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ABSTRACT 

Classifiers  for  surveillance  sonar  systems  are  often  designed 
to  operate  on  large  sets  of  predefined  clues,  or  features. 
Sometimes  the  mathematical  definitions  for  these  features 
are  poorly  known.  Other  times  the  designer  is  not  aware  that 
a  fixed  and  class-independent  linear  (or  affine)  relationship 
exists  between  subsets  of  features.  We  discuss  a  method 
based  on  Gram-Schmidt  orthogonalization  which  allows  the 
classifier  designer  to  determine  whether  subsets  of  features 
have  such  relationships.  Certain  features  can  then  be  shown 
unnecessary  by  application  of  Wozencraft  and  Jacobs’  “The¬ 
orem  of  Irrelevance”.  An  approach  is  also  described  to  rank 
features  to  aid  in  the  selection  of  an  effective  subset. 

I.  INTRODUCTION 

The  design  of  classifiers  for  a  surveillance  sonar  system 
is  often  based  on  a  pre-existing  set  of  measurements  or 
’’features”  considered  useful  for  discrimination.  This  set 
can  be  large  and  usually  includes  features  that  are  simple 
functions  of  basic  physical  measurements  of  the  objects  to 
be  classified.  Features  are  often  defined  via  linear  functions 
of  fundamental  measurements  or  other  more  basic  features. 
However,  these  functional  relationships  are  sometimes  un¬ 
known  to  the  designer  or  difficult  to  sort  out. 

Denote  as  Z  the  raw  measurements  collected  from  an 
object  to  be  classified.  Features  contained  in  the  vector  /(Z) 
are  said  to  be  affinely  dependent  on  another  set  of  features 
contained  in  the  vector  g(Z)  if  f{Z)  =b  +  Ag{Z),  VZ.  In 
other  words  there  exists  fixed  vector  b  and  matrix  A  such 
that  /(Z)  is  given  by  the  above  relationship  regardless  of 
the  object  type  (class).  Examples  include; 

(1)  If  p(Z)  consisted  of  a  single  feature  such  as  an 
estimate  of  the  angular  width  of  an  object  and  /(Z) 
was  the  cross-range  extent  measured  by  a  sonar  system 
corresponding  to  that  object  at  a  fixed  (non-  random) 
range  R  then  ffiZ)  =  A6*  and  f{Z)  =  RA.0.  Certain 
classes  of  objects  may  possess  greater  angular  widths 
than  others.  However,  feature  /(Z)  is  completely 
defined  by  ffiZ^. 

(2)  If  (f(Z)  and  /(Z)  are  estimated  probabilities  of  an 
object  undergoing  acceleration  or  maintaining  constant 
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velocity  respectively  at  time  t.  Such  estimates  are 
provided  by  tracking  algorithms  commonly  used  in 
sonar.  Note  that  g{Z)+f{Z)  =  1.  It  may  be  that 
certain  classes  of  objects  perform  more  maneuvers 
than  others  and  will  exhibit  this  in  terms  of  higher 
values  of  p(Z)  and  lower  values  of  /(Z).  However 
/(Z)  is  completely  defined  by  p(Z). 

In  such  cases  the  dependence  may  have  been  known  to  the 
classifier  designer,  but  in  real  cases  with  large  feature  sets 
the  dependence  may  be  unknown  or  difficult  to  determine. 
Furthermore,  in  real  problems,  more  complicated  examples 
of  affine  dependence  can  and  do  arise  in  obscure  and 
inadvertent  ways. 

We  are  interested  in  detecting  if  any  such  affine  rela¬ 
tionships  are  present  among  features.  In  this  short  paper 
we  describe  a  nonparametric  method  for  detecting  affinely 
dependent  features  that  is  based  on  the  linear  algebraic  struc¬ 
ture  of  the  feature  space  and  is  independent  of  the  underlying 
distributions  as  well  as  separability  among  classes.  Removal 
of  such  dependent  features  is  argued  through  application 
of  Wozencraft  and  Jacobs’  “Theorem  of  Irrelevance”  [1]. 
Removal  of  such  features  is  critical  as  they  add  no  additional 
information  and  unnecessarily  increase  the  dimensionality  of 
the  feature  space. 

“Approximate”  affine  dependence  can  also  be  identified 
and  a  ranking  of  the  features  is  possible.  The  pre-processing 
advocated  herein  is  a  simple  but  often  overlooked  first  step 
in  the  design  of  sonar  classifiers.  The  method  has  proved 
useful  for  detecting  and  removing  affinely  dependent  features 
present  in  Navy  feature  databases. 

II.  “THEOREM  OF  IRRELEVANCE” 

Denote  the  class-conditional  pdf  for  class  i  as  pr{p\i) 
where  r  is  the  random  vector  of  features  and  p  is  a  particular 
realization  of  r.  Then  decompose  r  as  r  =  [ri  r2]^.  Denote 
Pi  and  p2  as  the  corresponding  realizations  of  ri  and 
r2.  We  observe  that  Pr(p|*)  =  Pri(pi|*)  Pv:,(.P2.\iAi  = 
pi)  by  Bayes  rule.  Wozencraft  and  Jacobs’  ’’Theorem  of 
Irrelevance”  states  that  the  optimum  classifier/dichotomizer 
may  disregard  a  vector  r2  if  and  only  if 

Pr2(P2|f,ri  =  Pi)  =  Pr2(P2|ri  =  Pi)  (1) 

In  other  words  r2  conditioned  on  ri  must  be  statistically 
independent  of  i  for  r2  to  be  declared  ’’irrelevant”.  Another 
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useful  interpretation  is  that  satisfaction  of  this  requirement 
implies  that  ri  is  a  sufficient  statistic  for  estimating  the  pdfs 
Pr{p\i),  y i,  although  not  necessarily  minimal  [2],  [3],  [4]. 

III.  DETECTING  AEFINELY  DEPENDENT 
FEATURES 

The  approach  involves  construction  of  a  data  matrix  X 
in  which  all  available  data  are  included.  If  C  classes  are 
to  be  distinguished  and  Ni  (1  x  D)  real-valued  vector 
measurements  (i.e.  feature  vectors)  are  available  for  i  = 
X  is  a  N  X  D  data  matrix  where  each  row  is 
a  feature  vector  and  N  =  assumed  that 

N  >  D,  a  reasonable  assumption  as  there  are  commonly 
more  measurements  than  features.  An  unsupervised  situation 
in  which  N  unlabeled  feature  vectors  are  presented  can  also 
be  considered. 

Denote  the  D  features  by  /i,  /2, . . . ,  /d.  We  are  interested 
in  affine  dependence  of  the  form: 

D 

fk=  ajfj  +  bk  (2) 

where  bk  need  not  equal  zero.  Note  that  the  linear  algebraic 
concept  of  linear  dependence  among  a  set  of  vectors  does  not 
accommodate  affine  dependence.  Thus  a  transformation  such 
as  standardization  [5]  must  be  performed  before  standard 
linear  algebraic  techniques  can  be  applied.  This  amounts  to 
estimating  the  sample  mean  /t  and  sample  standard  deviation 
d  of  each  column  of  X  and  transforming  every  element  x 
in  that  column  according  to  a:  =  {x  —  Denote  X  after 
standardization  as  X.  The  transformed  features  are  denoted 
fi,  f2,  ■  ■  ■ ,  fo-  Any  affine  dependence  (bk  ^  0  in  eqn.  (2)) 
is  eliminated,  i.e. 

D 

/fc=  E  (3) 

i=i 

(see  appendix).  It  is  clear  that  the  elements  of  X  will  be 
approximately  zero  mean  with  unit  variance.  Thus  standard¬ 
ization  has  the  added  effect  of  controlling  the  dynamic  range 
of  the  elements  of  X  and  numerically  stabilizing  Gram- 
Schmidt  orthogonalization. 

Gram-Schmidt  orthogonalization  can  be  performed  on  the 
columns  of  X  via  QR  matrix  decomposition,  i.e  X  =  QR. 
The  linearly  dependent  column  vectors  in  X  are  marked  by 
zeros  along  the  diagonal  of  R. 

If  Rkk  is  the  leftmost  zero  element  along  the  diago¬ 
nal  of  R  then  Xk  is  linearly  dependent  on  the  vectors 
xi,  X2,  • .  • , Xk-i  and  there  exists  a  set  of  coefficients, 
di,a2,  ■  ■  ■ ,  such  that  Xk  can  be  expressed  as: 

Xk  =  a^xi -I- 0:2X2  +  ■  •  ■  +  cifc-iXk-i  (4) 

Note  that  at  least  one  (but  not  all)  o^’s  need  be  nonzero. 
Since  this  dependence  is  enforced  over  all  C  classes  and  for 
all  N  measurements,  the  probability  that  such  a  dependence 


manifested  due  to  chance  is  effectively  zero.  It  is  worth 
mentioning  that  the  dependence  is  not  linked  to  the  manner 
in  which  X  was  populated.  Specifically  a  permutation  of 
the  rows  of  X  (equivalently  X)  will  not  alter  the  linear 
dependence.  This  can  be  seen  by  noting  that  pre-multiplying 
Xk  in  (4)  by  a  permutation  matrix  results  in 

Uk  =  OiUi  +  a2U2  +  ■  •  ■  +  Ofe-iUk-i  (5) 

where  uj,  j  =  1,  ...,D  is  the  permuted  xj. 

Thus  feature  fk  must  be  expressible  as 

fk  =  0^1  fi  +  03/2  +  . . .  +  a^_ifk-i  +  bk  (6) 

or 

D 

i=i 

where  aj,j  >  k  must  equal  0.  If  Rmm  (k  <  m  <  D) 
is  the  next  zero  element  along  the  diagonal  of  R  then  Xm 
is  linearly  dependent  on  the  vectors  xi,X2, . . .  ,Xm-i  and 
there  exists  another  set  of  coefficients,  a™,  a™, . . . ,  q;™_2 
such  that  Xm  can  be  expressed  as: 

Xm  =  Q;™Xi-fQ;™X2  +  -  •  .  +  Q;™Xk+-  ■  •  +  am-lXm-l  (8) 


Once  again  not  all  d"^’s  need  be  nonzero.  Furthermore  since 
Xk  is  linearly  dependent  on  xi,  X2, . . . ,  Xk-i,  <5™  can  be  set 
equal  to  zero.  Thus  fm  can  be  expressed  as 


D 

=  (9) 

i=i 

where  a™  must  equal  zero  for  j  =  k  as  well  as  for  j  >  m. 

Therefore  the  set  of  features  r2  for  which  Rjj  =  0  are 
affinely  dependent  on  the  remainder  of  the  features  denoted 
ri.  From  eqns.  (7)  and  (9),  we  can  see  that  is  completely 
determined  by  pi  independent  of  class  i.  Thus  eqn.  1  is 
satisfied  and  r2  is  irrelevant.  It  is  worth  noting  that  if  X  = 
D  and  X  is  full  rank  then  X  has  a  rank  of  I?  —  1.  Thus 
this  process  can  be  successfully  applied  only  if  N  is  strictly 
greater  than  D. 

Even  if  an  exact  affine  dependence  existed,  errors  in 
machine  calculation  of  R  can  lead  to  non-zero  Rjj .  Higham 
[6,  p.  24,122]  has  shown  that  performing  the  QR  decom¬ 
position  via  a  sequence  of  Givens  rotations  leads  to  an 
ultimate  relative  error  that  is  acceptable  and  on  the  level  of 
machine  precision  u.  Here  relative  error,  defined  as 

NXII2 

(=  is  the  normalized  difference  between  the  exact 

NX]|f 

R  and  R,  estimated  by  the  actual  decomposition.  Note  that 


|R-R| 

IIXIIf 


De 

Vnd 


(10) 


where  e  is  the  average  value  of  \Rij  —  Rij\  (i,j  =  1, 

Thus  if  TV  «  10017,  Rjj  can  be  compared  to  a  threshold  of 
ten  times  machine  precision  to  detect  linear  dependence. 
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If  certain  features  are  preferred  (ex.  considered  more 
intuitive  or  powerful)  by  the  designer,  reordering  can  be 
performed  such  that  these  populate  the  leftmost  columns  of 
X.  This  step  increases  the  likelihood  that  such  a  feature  will 
be  retained  in  the  event  that  it  is  affinely  dependent  on  other 
features. 

IV.  “APPROXIMATELY”  AFFINELY  DEPENDENT 
FEATURES 

At  this  point  let  us  assume  that  features  that  are  exactly 
affinely  dependent  have  been  detected  and  removed.  Op¬ 
erating  on  the  resultant  set,  the  process  of  Gram-Schmidt 
orthogonalization  of  Xk  amounts  to 

(1)  Determining  the  least-squares  fit  of  Xk  in  the  sub¬ 
space  spanned  by  {xi,  ...,Xk-i}-  This  is  equivalent  to 
determining  the  projection  of  Xk  onto  this  subspace. 
We  denote  this  projection  as  Pxk  where  P  is  the 
projection  matrix. 

(2)  Subtracting  the  result  from  Xk  to  form  the  orthogonal 
error  vector  e.  Thus  Xk  =  Pxk  +  e.  The  norm  of  the 
error  vector  e,  ||e||,  is  minimized  in  this  process  [3, 
p.  365]  but  is  never  zero. 

Again  we  observe  that  permuting  the  rows  of  X  (equiva¬ 
lently  X)  only  permutes  the  elements  of  e  but  leaves  ||e|| 
unchanged. 

The  relationship  of  Gram-Schmidt  orthogonalization  to 
least-squares  estimation  is  important  in  that,  for  N  >  D, 
we  can  interpret  feature  fk  as  a  sum  of  A)  a  fixed  (and 
class-independent)  linear  function  of  the  previous  features 
and  B)  a  residual.  It  is  possible  that  some  (or  all)  of  the 
previous  features  may  be  useful  for  classification.  It  is  also 
possible  that  the  least-squares  estimate  of  fk  in  terms  of  the 
previous  features  is  useful  as  well.  However  this  information 
is  implicit  in  the  previous  features.  Thus  it  is  sufficient  to 
determine  the  added  information  captured  in  the  residual. 
This  amounts  to  a  statistical  test'  comparing  the  scalar  values 
in  e  corresponding  to  class  i  with  those  corresponding  to 
class  j.  Nonparametric  tests  such  as  a  Chi-squared  test 
(or  Kolmogorov-Smirnov  test)  can  be  used  to  compare  two 
samples.  Specifically  the  p-value  of  the  test  statistic  can  be 
returned.  The  p-value  is  the  probability  of  the  event  that 
a  value  of  the  test  statistic  greater  than  or  equal  to  that 
observed  occurs  when  both  samples  have  a  common  proba¬ 
bility  distribution.  Standard  critical  levels  of  significance  are 
0.05.  Thus  if  the  p- value  is  less  than  the  critical  level  we 
can  safely  conclude  that  the  distribution  governing  the  two 
samples  is  different.  A  difference  between  the  distributions 
indicates  that  the  feature  may  prove  useful  for  classification. 
It  must  be  stressed  that  if  there  are  only  a  few  measurements 
per  class,  a  large  p-value  need  not  indicate  distributional 
similarity.  However  if  there  are  enough  measurements  such 
that  p-values  lower  than  the  critical  level  are  at  least  possible 

*or  set  of  pairwise  tests  when  more  than  two  classes  are  considered 


then  if  the  p-value  is  greater  than  the  critical  level  we  may 
conclude  fk  is  “approximately”  affinely  dependent  on  the 
previous  features.  Specifically  eqn.  (1)  with  r2  =  fk  and 
ri=  [/i,  /2, . . . ,  fk-i]'^  is  satisfied  at  the  reported  p-value. 
Graphical  methods  such  as  quantile-quantile  plots  can  be 
used  to  corroborate  the  conclusions. 

The  process  of  feature  subset  selection  involves  consider¬ 
ing  various  subsets  and,  via  a  pre-chosen  separability  mea¬ 
sure  (a.k.a.  Filter  Method)  or  a  classifier  architecture  (a.k.a. 
Wrapper  Method),  ranking  the  quality  of  each  subset  [7], 
[8].  As  testing  every  possible  subset  is  usually  prohibitive, 
alternatives  that  consider  only  specific  subsets  are  almost 
always  applied.  One  such  approach  is  as  follows.  Features 
can  be  ranked  according  to  p-value.  If  it  is  required  that  a 
feature  must  be  discarded,  the  feature  with  the  largest  p- 
value  can  be  selected.  The  entire  process  of  Gram-Schmidt 
orthogonalization  followed  by  statistical  testing  is  then  re¬ 
peated  if  further  reduction  is  required. 

V.  SUMMARY 

Methodologies  are  provided  to  detect  exact  and  “approx¬ 
imate”  affine  relationships  within  a  set  of  features.  They 
provide  insight  into  feature  structure  and  aid  in  feature  subset 
selection. 


VI.  APPENDIX 

Assume  the  affine  dependence  of  eqn.  (11)  for  feature  fk- 

D 

fk=  ctjfj~‘rhk  (11) 

i=i 

This  dependence  must  be  enforced  regardless  of  class.  Thus 

D 

Xk  =  ^  ajxj  +  &fclN  (12) 


where  xj  is  the  j*'*  column  of  X  (j  =  1, ..,  D)  and  In  is 
the  vector  [11  ...  1]^. 

N 


Denote  xj^In,  the  sum  of  elements  in  column  j,  as  Sj. 
Denote  the  columns  of  X  after  standardization  as  xj.  Thus 
Xj  equals 


Xj  -  AjlN 


(13) 


where  ftj  =  ^  and  (t|  =  (^j  Ajt-NMxj  AjIn)^  Rewriting 
eqn.  (12)  in  terms  of  xj  results  in: 


D 

0-fcXk  +  AfelN=  X!  +  AjIn)  +  &fclN  (14) 
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Subtracting  /ifclN  from  both  sides  of  (14)  yields: 


D 


j  =  l,j¥=k 


D 

AjIn  +  ^fclN  —  AklN 

-j  =  i,3¥=k 

' - V - " 

r 


It  is  clear  that  the  vector  Aj  In  +  ^fclN  and 

AfclN  can  be  represented  as  cIn  and  (IIn  respectively  and 
that  r=  (c  —  (i)lN-  Pre-multiplying  both  T  and  eqn.  (12) 
by  1n^  reveals  that  c=  d.  Thus 


D 

Xk  =  a^Xj  (15) 

i=i 


where  a*  equals  a^djldk- 
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